This course will be an introduction to R and Github.
Here is the link to my IODS page
link
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
learning2014 <- read.csv("learning2014.csv", header=TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Deep : int 43 35 42 42 44 57 46 39 52 48 ...
## $ Stra : int 27 22 29 25 29 29 18 32 34 28 ...
## $ Surf : int 31 38 27 27 34 29 23 34 26 36 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 8
The “learning2014” dataset constitutes of 166 observations and 8 variables. The variables (gender, Age, Attitude, Deep, Stra, Surf and Points) describe the results of a query study made as a collaboration international project with teachers.
summary(learning2014)
## X gender Age Attitude Deep
## Min. : 1.00 F:110 Min. :17.00 Min. :14.00 Min. :19.00
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:40.00
## Median : 83.50 Median :22.00 Median :32.00 Median :44.00
## Mean : 83.50 Mean :25.51 Mean :31.43 Mean :44.16
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:49.00
## Max. :166.00 Max. :55.00 Max. :50.00 Max. :59.00
## Stra Surf Points
## Min. :10.00 Min. :19.00 Min. : 7.00
## 1st Qu.:21.00 1st Qu.:29.00 1st Qu.:19.00
## Median :25.50 Median :34.00 Median :23.00
## Mean :24.97 Mean :33.45 Mean :22.72
## 3rd Qu.:29.00 3rd Qu.:38.00 3rd Qu.:27.75
## Max. :40.00 Max. :52.00 Max. :33.00
There are 110 female and 56 male students. The median age is 22 years and range [17-55]. Students got a median of 44 points [19-59] in the deep approach (Deep), 25.50 [10-40] in the strategic approach (Stra), 34 points [19-52] in the surface approach (Surf). Their max points were 23 [7-33] as a median (Points).
pairs(learning2014[-1], col=learning2014$gender)
The data is visualised according to gender (males as black dots and females as red dots) on multiple bivariable correlation scatter plots.
The data can also be visualised using the “GGally” and “ggplot2” packages
library("GGally")
## Warning: package 'GGally' was built under R version 3.3.2
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.3.2
p <- ggpairs(learning2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
Now we can see that male students have a higher score in attitude questions and female students in strategic questions. We can also see that Variables Points and Attitude have the highest (cor: 0.43) and Surf and Deep second highest correlation factor (cor: 0.32).
The “Points” student achieve is tested with a multivariate linear regression model using Attitude, Stra and Surf as independent variables. The model tries to describe any linear correlation between the independent and dependent variables.
points_regression <- lm(Points ~ Attitude + Stra + Surf, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## Stra 0.10664 0.06770 1.575 0.11716
## Surf -0.04884 0.06678 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In the first model with all three variables, Attitude explains the Points with a coefficient estimate of 0.34 (p<0.0001). Other variables Stra and Surf do not explain Points with a statistical significance. Thus, first Surf is removed and the model is tested again.
points_regression <- lm(Points ~ Attitude + Stra, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude + Stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## Stra 0.11421 0.06681 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Attitude is still the only relevant variable and, thus, Stra is removed and the model is run with Attitude alone (coefficient 0.35, p<0.0001).
points_regression <- lm(Points ~ Attitude, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The adjusted R-squared score is 0.19 meaning that using this model, Attitude explains 19% of the total variance of the dependent variable Points.
plot(points_regression, which = c(1,2,5))
Residuals of the linear regression model describe the validity of the model. Residuals should be normally distributed, they should not correlate with each others, have a constant varaiance and their size should not depend on the size of the independent variable.
The QQ-plot describes the distribution of the residuals. Our model is normally divided as the measurements are set on straight increasing line.
The Residuals vs Fitted plot describes how the residuals depend on the explanatory variable. According to the plot, the resudials are resonably evenly distributed on both sides of the residual level = 0 line and, thus, do not depend on the size of the explanatory variable.
The Residuals vs Leverage plot can help identify, which observations have an unusually high impact on the model. According to the plot, all measurements are divided between [0,0.05] and thus none have an unusual impact on the model.
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
alc <- read.csv("alc.csv", header = TRUE)
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
dim(alc)
## [1] 382 36
The data constitutes of 382 observations and 36 variables. Their names are presented in the above and describe the correlation between alcohol usage and the social, gender and study time attributes for each student.
Data Set Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 nursery - attended nursery school (binary: yes or no)
13 internet - Internet access at home (binary: yes or no)
14 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
15 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
16 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
17 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
18 schoolsup - extra educational support (binary: yes or no)
19 famsup - family educational support (binary: yes or no)
20 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
21 activities - extra-curricular activities (binary: yes or no)
22 higher - wants to take higher education (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
34 alc_use - average of ‘Dalc’ and ‘Walc’
35 high_use - TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
High amount of ‘absences’, no ‘famsup’, low ‘studytime’ and bad ‘famrel’ correlates with a higher alcohol consumption
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
g1 <- ggplot(alc, aes(x = alc_use, y = absences, fill = sex))
g1 + geom_bar(stat="identity") + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = alc_use, y = (famsup = "yes"), fill=sex))
g2 + geom_bar(stat="identity") + ylab("family educational support") + ggtitle("Student family educational support by alcohol consumption")
g3 <- ggplot(data = alc, aes(x = alc_use, y = studytime, fill=sex))
g3 + geom_bar(stat="identity") + ylab("weekly study time") + ggtitle("Student weekly study time by alcohol consumption and sex")
g4 <- ggplot(alc, aes(x = alc_use, y = famrel, fill = sex))
g4 + geom_bar(stat="identity") + ylab("quality of family relationships") + ggtitle("Student quality of family relationships by alcohol consumption and sex")
alc %>% group_by(famsup) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 2 × 3
## famsup count mean_alc_use
## <fctr> <int> <dbl>
## 1 no 144 1.965278
## 2 yes 238 1.842437
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 9 × 3
## alc_use count mean_absences
## <dbl> <int> <dbl>
## 1 1.0 140 3.357143
## 2 1.5 69 4.231884
## 3 2.0 59 3.915254
## 4 2.5 44 6.431818
## 5 3.0 32 6.093750
## 6 3.5 17 5.647059
## 7 4.0 9 6.000000
## 8 4.5 3 12.000000
## 9 5.0 9 6.888889
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## # A tibble: 9 × 3
## alc_use count mean_studytime
## <dbl> <int> <dbl>
## 1 1.0 140 2.307143
## 2 1.5 69 1.971014
## 3 2.0 59 1.983051
## 4 2.5 44 1.931818
## 5 3.0 32 1.718750
## 6 3.5 17 1.470588
## 7 4.0 9 1.777778
## 8 4.5 3 2.000000
## 9 5.0 9 1.666667
alc %>% group_by(famrel) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
## famrel count mean_alc_use
## <int> <int> <dbl>
## 1 1 8 2.250000
## 2 2 19 2.184211
## 3 3 64 2.000000
## 4 4 189 1.880952
## 5 5 102 1.750000
Lack of family educational support and bad quality of family relationships seem to correlate with higher alcohol consumption. Additionally, high alcohol consumption correlates with high absences from school. In addition, students with high alcohol consumption (alc_use > 3.0) study less than students with low alcohol comsumption (alc_use < 2.5). These findings are concordant with the primary hypotheses.
alc$famrel <- factor(alc$famrel)
alc$studytime <- factor(alc$studytime)
m <- glm(high_use ~ absences + famsup + alc$studytime + alc$famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + famsup + alc$studytime +
## alc$famrel, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1803 -0.8226 -0.6503 1.1522 2.2409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.015455 0.878444 -1.156 0.247693
## absences 0.076934 0.022630 3.400 0.000675 ***
## famsupyes -0.061188 0.245702 -0.249 0.803337
## alc$studytime2 -0.439698 0.267726 -1.642 0.100520
## alc$studytime3 -1.342617 0.442054 -3.037 0.002388 **
## alc$studytime4 -1.279644 0.588173 -2.176 0.029583 *
## alc$famrel2 0.936030 0.964897 0.970 0.332005
## alc$famrel3 0.643055 0.883261 0.728 0.466585
## alc$famrel4 0.272456 0.858654 0.317 0.751012
## alc$famrel5 -0.006842 0.876748 -0.008 0.993773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 427.24 on 372 degrees of freedom
## AIC: 447.24
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3622376 0.04877998 1.8120704
## absences 1.0799706 1.03533984 1.1316272
## famsupyes 0.9406468 0.58232256 1.5284716
## alc$studytime2 0.6442312 0.38110876 1.0908572
## alc$studytime3 0.2611613 0.10353967 0.5966021
## alc$studytime4 0.2781364 0.07612213 0.8066302
## alc$famrel2 2.5498396 0.42150543 21.4386194
## alc$famrel3 1.9022832 0.37564722 14.2026417
## alc$famrel4 1.3131852 0.27385709 9.4789868
## alc$famrel5 0.9931812 0.19882276 7.3471751
According to the logistic regression model higher absences and studying more than 5 hours a week correlates with lower alcohol consumption. Lack of family support and quality of family relationships do not correlate significantly with alcohol cosumption in this model.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 94 20
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.66230366 0.03926702
## TRUE 0.24607330 0.05235602
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.24607330 0.05235602 0.29842932
## Sum 0.90837696 0.09162304 1.00000000
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2853403
According to the results, 29% of the predictions were wrong. This is clearly less than simple guessing (50% predictions wrong), although not optimal.
library(boot)
#cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
#cv
#cv$delta[1]
The test set performance is worse as the model makes 31% of predictions wrong.
#setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data frame has 506 rows and 14 columns. The data frame contains the following variables: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv.
pairs(Boston)
library(dplyr)
cor_matrix<-cor(Boston)
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
cor_matrix %>% round(2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
#install.packages("corrplot")
library(corrplot)
corrplot(cor_matrix %>% round(2), method="circle")
corrplot(cor_matrix %>% round(2), method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
There seem to be a strong negative correlation between variables age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres), nox (nitrogen oxides concentration) and dis, indus (proportion of non-retail business acres per town) and dis and lstat (lower status of the population) and medv (median value of owner-occupied homes in $1000s). On the other hand, there is strong positive correalation between indus and nox, indus and tax (full-value property-tax rate per $10,000), nox and age, nox and tax, rm (average number of rooms per dwelling) and medv and rad (index of accessibility to radial highways) and tax.
By looking at the difference between the medians and means and at their value respective to the min. and max. values, it seems that variables crim, zn, age, rad, tax, indus and black would be distributed non-parametrically and others perhaps parametrically. However, to make such conclusions, one should first run a test for normality (for instance Kolmogorov-Smirnov).
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaling reduces the mean of each of the values. Thus, after scaling, the variables have all 0 as a mean and it is easier to inspect the distribution. The closer the median is to 0, the more parametrical the variable. Scaling does not transform the data such as logarithmic trasformation.
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
#remove original crim from the dataset
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
#install.packages("lda")
library(lda)
lda.fit <- lda(crime~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2500000 0.2648515 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 0.9367670 -0.9192932 -0.10997442 -0.8709345 0.47879112
## med_low -0.1253317 -0.2571739 -0.03844192 -0.5692501 -0.15762951
## med_high -0.3814499 0.1717640 0.09562423 0.3712930 0.04404255
## high -0.4872402 1.0149946 -0.11325431 1.0622652 -0.40753554
## age dis rad tax ptratio
## low -0.8365559 0.8354023 -0.6965304 -0.7567854 -0.41041369
## med_low -0.3657588 0.3475316 -0.5395408 -0.4385409 -0.05583518
## med_high 0.4217848 -0.3554530 -0.4430577 -0.3355635 -0.24494840
## high 0.8032995 -0.8336607 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.38455308 -0.77147434 0.54064282
## med_low 0.31406046 -0.11525783 -0.02315994
## med_high 0.05217546 0.05100453 0.12792499
## high -0.87646727 0.86583729 -0.77280345
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12048851 0.731923803 -1.0135288
## indus 0.04685408 -0.339015574 0.2724238
## chas -0.00582327 -0.029055786 0.1114995
## nox 0.27652793 -0.778624791 -1.3301345
## rm 0.04921114 0.007748152 -0.0944418
## age 0.23447836 -0.291399166 -0.3216750
## dis -0.08298416 -0.332508100 0.1107297
## rad 3.68954852 0.833193212 -0.1716712
## tax -0.01324991 0.117864760 0.7763480
## ptratio 0.15271757 -0.027330164 -0.3974084
## black -0.11911434 0.033326600 0.1346439
## lstat 0.11398414 -0.325734336 0.3741871
## medv -0.02968968 -0.548674036 -0.2386007
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9549 0.0327 0.0124
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 9 2 0
## med_low 4 16 5 0
## med_high 0 4 13 2
## high 0 0 1 27
According to the results, the model predicts well the true values of the categorical crime variable. Most of the errors are related to the med_low group where the specificity is the lowest.
#data('Boston')
#boston_scaled <- scale(Boston)
dist_eu <- dist(boston_scaled)
## Warning in dist(boston_scaled): NAs introduced by coercion
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5270 4.9080 4.9390 6.2420 13.0000
dist_man <- dist(boston_scaled, method = "manhattan")
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)
km <-kmeans(dist_eu, centers = 3)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3
## 0.4426877 0.2075099 0.3498024
##
## Group means:
## zn indus chas nox rm age
## 1 -0.2309344 -0.4551440 -0.27232907 -0.4915087 -0.1311176 -0.2883074
## 2 1.3140079 -0.9068835 0.47759479 -0.7455383 1.0548777 -0.7434166
## 3 -0.4872402 1.1139831 0.06132349 1.0642908 -0.4598408 0.8058735
## dis rad tax ptratio black lstat
## 1 0.2864417 -0.5768315 -0.6147205 -0.006886361 0.3154274 -0.2702405
## 2 0.7952165 -0.5935800 -0.6372993 -0.976736455 0.3398644 -0.8793299
## 3 -0.8342410 1.0821251 1.1560103 0.588134874 -0.6007995 0.8636357
## medv crimemed_low crimemed_high crimehigh
## 1 -0.02274038 0.43750000 0.2589286 0.0000000
## 2 1.16160305 0.16190476 0.2761905 0.0000000
## 3 -0.66030777 0.06214689 0.2203390 0.7175141
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn 0.172610013 -1.25618179
## indus 1.078970813 -0.12226466
## chas 0.004463932 -0.49433521
## nox 0.921052125 -0.16586799
## rm -0.040950198 -0.49922411
## age -0.067015888 0.07618846
## dis 0.138379223 -0.06104503
## rad 0.579745765 0.43033468
## tax 0.401343141 -0.66310399
## ptratio 0.374368029 0.11727097
## black -0.061203934 0.05315043
## lstat 0.376322744 -0.56677965
## medv 0.109367758 -0.69236403
## crimemed_low -0.389791316 0.15910228
## crimemed_high -0.486827228 -0.58792893
## crimehigh -0.207144258 -1.08365331
##
## Proportion of trace:
## LD1 LD2
## 0.7951 0.2049
plot(lda.fit, dimen = 2)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Using 3 clusters for K-means analysis, the most influencial linear separators are nox and chas. This means that if the data would be divided in 3 classes, variables nox and chas would explain most of the dimentional difference and would thus be the best variables to predict possible new observations.
model_predictors <- dplyr::select(train, -crime)
###Check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 16 2
###Matrix multiplication
#matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
#library("plotly")
###Create a 3D plot (Cool!) of the columns of the matrix product
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$center)
The same results as above is visualised in 3D using the variable crime in the first plot and the number of clusters selected in the second to separate observations with colors. For some reason, the visualisation does not work after knitting the file.